31 research outputs found

    A Framework for Aggregating Private and Public Web Archives

    Full text link
    Personal and private Web archives are proliferating due to the increase in the tools to create them and the realization that Internet Archive and other public Web archives are unable to capture personalized (e.g., Facebook) and private (e.g., banking) Web pages. We introduce a framework to mitigate issues of aggregation in private, personal, and public Web archives without compromising potential sensitive information contained in private captures. We amend Memento syntax and semantics to allow TimeMap enrichment to account for additional attributes to be expressed inclusive of the requirements for dereferencing private Web archive captures. We provide a method to involve the user further in the negotiation of archival captures in dimensions beyond time. We introduce a model for archival querying precedence and short-circuiting, as needed when aggregating private and personal Web archive captures with those from public Web archives through Memento. Negotiation of this sort is novel to Web archiving and allows for the more seamless aggregation of various types of Web archives to convey a more accurate picture of the past Web.Comment: Preprint version of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2018) full paper, accessible at the DO

    HTTP Mailbox - Asynchronous Restful Communication

    Get PDF
    Traditionally, general web services used only the GET and POST methods of HTTP while several other HTTP methods like PUT, PATCH, and DELETE were rarely utilized. Additionally, the Web was mainly navigated by humans using web browsers and clicking on hyperlinks or submitting HTML forms. Clicking on a link is always a GET request while HTML forms only allow GET and POST methods. Recently, several web frameworks/libraries have started supporting RESTful web services through APIs. To support HTTP methods other than GET and POST in browsers, these frameworks have used hidden HTML form fields as a workaround to convey the desired HTTP method to the server application. In such cases, the web server is unaware of the intended HTTP method because it receives the request as POST. Middleware between the web server and the application may override the HTTP method based on special hidden form field values. Unavailability of the servers is another factor that affects the communication. Because of the stateless and synchronous nature of HTTP, a client must wait for the server to be available to perform the task and respond to the request. Browser-based communication also suffers from cross-origin restrictions for security reasons. We describe HTTP Mailbox, a mechanism to enable RESTful HTTP communication in an asynchronous mode with a full range of HTTP methods otherwise unavailable to standard clients and servers. HTTP Mailbox also allows for multicast semantics via HTTP. We evaluate a reference implementation using ApacheBench (a server stress testing tool) demonstrating high throughput (on 1,000 concurrent requests) and a systemic error rate of 0.01%. Finally, we demonstrate our HTTP Mailbox implementation in a human-assisted Web preservation application called “Preserve Me! and a visualization application called Preserve Me! Viz

    MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing

    Get PDF
    With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in Memento aggregators. A memento is a past version of a web page and a Memento aggregator is a tool or service that aggregates mementos from many different web archives. To save resources, the Memento aggregator should only poll the archives that are likely to have a copy of the requested Uniform Resource Identifier (URI). Using the Crawler Index (CDX), we generate profiles of the archives that summarize their holdings and use them to inform routing of the Memento aggregator’s URI requests. Additionally, we use full text search (when available) or sample URI lookups to build an understanding of an archive’s holdings. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. For evaluation we used CDX files from Archive-It, UK Web Archive, Stanford Web Archive Portal, and Arquivo.pt. Moreover, we used web server access log files from the Internet Archive’s Wayback Machine, UK Web Archive, Arquivo.pt, LANL’s Memento Proxy, and ODU’s MemGator Server. In addition, we utilized historical dataset of URIs from DMOZ. In early experiments with various URI-based static profiling policies we successfully identified about 78% of the URIs that were not present in the archive with less than 1% relative cost as compared to the complete knowledge profile and 94% URIs with less than 10% relative cost without any false negatives. In another experiment we found that we can correctly route 80% of the requests while maintaining about 0.9 recall by discovering only 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile. We created MementoMap, a framework that allows web archives and third parties to express holdings and/or voids of an archive of any size with varying levels of details to fulfil various application needs. Our archive profiling framework enables tools and services to predict and rank archives where mementos of a requested URI are likely to be present. In static profiling policies we predefined the maximum depth of host and path segments of URIs for each policy that are used as URI keys. This gave us a good baseline for evaluation, but was not suitable for merging profiles with different policies. Later, we introduced a more flexible means to represent URI keys that uses wildcard characters to indicate whether a URI key was truncated. Moreover, we developed an algorithm to rollup URI keys dynamically at arbitrary depths when sufficient archiving activity is detected under certain URI prefixes. In an experiment with dynamic profiling of archival holdings we found that a MementoMap of less than 1.5% relative cost can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive without any false negatives (i.e., 100% recall). In addition, we separately evaluated archival voids based on the most frequently accessed resources in the access log and found that we could have avoided more than 8% of the false positives without introducing any false negatives. We defined a routing score that can be used for Memento routing. Using a cut-off threshold technique on our routing score we achieved over 96% accuracy if we accept about 89% recall and for a recall of 99% we managed to get about 68% accuracy, which translates to about 72% saving in wasted lookup requests in our Memento aggregator. Moreover, when using top-k archives based on our routing score for routing and choosing only the topmost archive, we missed only about 8% of the sample URIs that are present in at least one archive, but when we selected top-2 archives, we missed less than 2% of these URIs. We also evaluated a machine learning-based routing approach, which resulted in an overall better accuracy, but poorer recall due to low prevalence of the sample lookup URI dataset in different web archives. We contributed various algorithms, such as a space and time efficient approach to ingest large lists of URIs to generate MementoMaps and a Random Searcher Model to discover samples of holdings of web archives. We contributed numerous tools to support various aspects of web archiving and replay, such as MemGator (a Memento aggregator), Inter- Planetary Wayback (a novel archival replay system), Reconstructive (a client-side request rerouting ServiceWorker), and AccessLog Parser. Moreover, this work yielded a file format specification draft called Unified Key Value Store (UKVS) that we use for serialization and dissemination of MementoMaps. It is a flexible and extensible file format that allows easy interactions with Unix text processing tools. UKVS can be used in many applications beyond MementoMaps

    HTTP Mailbox - Asynchronous RESTful Communication

    Full text link
    We describe HTTP Mailbox, a mechanism to enable RESTful HTTP communication in an asynchronous mode with a full range of HTTP methods otherwise unavailable to standard clients and servers. HTTP Mailbox allows for broadcast and multicast semantics via HTTP. We evaluate a reference implementation using ApacheBench (a server stress testing tool) demonstrating high throughput (on 1,000 concurrent requests) and a systemic error rate of 0.01%. Finally, we demonstrate our HTTP Mailbox implementation in a human assisted web preservation application called "Preserve Me".Comment: 13 pages, 6 figures, 8 code blocks, 3 equations, and 3 table

    MementoMap: An Archive Profile Dissemination Framework

    Get PDF
    We introduce MementoMap, a framework to express and disseminate holdings of web archives (archive profiles) by themselves or third parties. The framework allows arbitrary, flexible, and dynamic levels of details in its entries that fit the needs of archives of different scales. This enables Memento aggregators to significantly reduce wasted traffic to web archives

    Impact of HTTP Cookie Violations in Web Archives

    Get PDF
    Certain HTTP Cookies on certain sites can be a source of content bias in archival crawls. Accommodating Cookies at crawl time, but not utilizing them at replay time may cause cookie violations, resulting in defaced composite mementos that never existed on the live web. To address these issues, we propose that crawlers store Cookies with short expiration time and archival replay systems account for values in the Vary header along with URIs
    corecore